29 research outputs found

    Unbiased Estimation for Linear Regression When n < v

    Get PDF

    High-dimensional Linear Regression Problems via Graphical Models

    Get PDF
    This thesis introduces a new method for solving the linear regression problem where the number of observations n is smaller than the number of variables (predictors) v. In contrast to existing methods such as ridge regression, Lasso and Lars, the proposed method uses the idea of graphical models and provides unbiased parameter estimates under certain conditions. In addition, the new method provides a detailed graphical conditional correlation structure for the predictors, whereby the real causal relationship between predictors can be identified. Furthermore, the proposed method is extended to form a hybridisation with the idea of ridge regression to improve efficiency in terms of computation and model selection. In the extended method, less important variables are regularised by a ridge type penalty, and a search for models in the space is made for important covariates. This significantly reduces computational cost while giving unbiased estimates for the important variables as well as increasing the efficiency of model selection. Moreover, the extended method is used in dealing with the issue of portfolio selection within the Markowitz mean-variance framework, with n<v. Various simulations and real data analyses were conducted for comparison between the two novel methods and the aforementioned existing methods. Our experiments indicate that the new methods outperform all the other methods when n<v

    Hybrid Graphical Least Square Estimation and its application in Portfolio Selection

    Get PDF
    This paper proposes a new regression method based on the idea of graphical models to deal with regression problems with the number of covariates v larger than the sample size N. Unlike the regularization methods such as ridge regression, LASSO and LARS, which always give biased estimates for all parameters, the proposed method can give unbiased estimates for important parameters (a certain subset of all parameters). The new method is applied to a portfolio selection problem under the linear regression framework and, compared to other existing methods, it can assist in improving the portfolio performance by increasing its expected return and decreasing its risk. Another advantage of the proposed method is that it constructs a non-sparse (saturated) portfolio, which is more diversified in terms of stocks and reduces the stock-specific risk. Overall, four simulation studies and a real data analysis from London Stock Exchange showed that our method outperforms other existing regression methods when N < v

    An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace

    Full text link
    To minimize the effect of outliers, kNN ensembles identify a set of closest observations to a new sample point to estimate its unknown class by using majority voting in the labels of the training instances in the neighbourhood. Ordinary kNN based procedures determine k closest training observations in the neighbourhood region (enclosed by a sphere) by using a distance formula. The k nearest neighbours procedure may not work in a situation where sample points in the test data follow the pattern of the nearest observations that lie on a certain path not contained in the given sphere of nearest neighbours. Furthermore, these methods combine hundreds of base kNN learners and many of them might have high classification errors thereby resulting in poor ensembles. To overcome these problems, an optimal extended neighbourhood rule based ensemble is proposed where the neighbours are determined in k steps. It starts from the first nearest sample point to the unseen observation. The second nearest data point is identified that is closest to the previously selected data point. This process is continued until the required number of the k observations are obtained. Each base model in the ensemble is constructed on a bootstrap sample in conjunction with a random subset of features. After building a sufficiently large number of base models, the optimal models are then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page

    Graphical group ridge

    No full text

    Optimal model selection for k-nearest neighbours ensemble via sub-bagging and sub-sampling with feature weighting

    No full text
    This paper proposes two novel approaches based on feature weighting and model selection for building more accurate kNN ensembles. The first approach identifies the nearest observations using a feature weighting scheme concerning the response variable via support vectors. A randomly selected subset of features is used for the feature weighting and model construction. After building a sufficiently large number of base models on bootstrap samples, a subset of the models is selected based on out-of-bag prediction error for the final ensemble. The second approach builds base learners build on random subsamples instead of bootstrap samples with a random subset of features. The method uses feature weighting while building the models. The remaining observations from each sample are used to assess the corresponding base learner and select a subset of the models for the final ensemble. The suggested ensemble methods are assessed on 12 benchmark datasets against other classical methods, including kNN-based models. The analyses reveal that the proposed methods are often better than the others

    Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Colon dataset.

    No full text
    Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Colon dataset.</p

    Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for TumorC dataset.

    No full text
    Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for TumorC dataset.</p

    Brief description of the datasets along with the corresponding number of features, observations, class-wise distributions and sources.

    No full text
    Brief description of the datasets along with the corresponding number of features, observations, class-wise distributions and sources.</p

    Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Lungcancer dataset.

    No full text
    Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Lungcancer dataset.</p
    corecore